Skip to content

Conversation

GarrettWu
Copy link
Contributor

@GarrettWu GarrettWu commented Dec 4, 2024

Patches the json output as BQ JSON type instead of STRING. Which avoids JSON destination table error in b/381148539 and unblocks Multimodal integrations.

Our current implementation convert JSON to STRING completely when reading into BigFrames. Here it uses pd.ArrowDtype(pa.large_string()) as BigFrames Dtype and pa.large_string() as PA dtype, and pass the information though in BigFrames to determine the output type and parse STR back to JSON at the end.

It is a workaround of the workaround of JSON as STR. We shall rework the JSON support after pyarrow brings in json type.

@GarrettWu GarrettWu self-assigned this Dec 4, 2024
@GarrettWu GarrettWu requested review from a team as code owners December 4, 2024 01:43
@product-auto-label product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Dec 4, 2024
@GarrettWu GarrettWu marked this pull request as draft December 4, 2024 20:47
@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Dec 5, 2024
@GarrettWu GarrettWu marked this pull request as ready for review December 6, 2024 18:46
@tswast
Copy link
Collaborator

tswast commented Dec 10, 2024

Question: Does this change anything with regards to I/O? For example, I notice with read_csv with the pandas-y code path, the DataFrame from pandas puts string columns as the large_string() in my testing. Hopefully such columns don't end up as JSON in BigQuery now.

@GarrettWu
Copy link
Contributor Author

GarrettWu commented Dec 10, 2024

Question: Does this change anything with regards to I/O? For example, I notice with read_csv with the pandas-y code path, the DataFrame from pandas puts string columns as the large_string() in my testing. Hopefully such columns don't end up as JSON in BigQuery now.

How did you test? I tried pandas df with pa.large_string dtype is interpreted as normal string. It shouldn't affect input as large_string, since we convert it to normal string in our I/O.

screen/763yZzyywmEVWbR

destination table

screen/4u6scyqtczXSAK8

@tswast
Copy link
Collaborator

tswast commented Dec 11, 2024

How did you test?

It was part of my debugger step-through of #371 but it's possible I accidentally changed something in the I/O path there.

@GarrettWu GarrettWu enabled auto-merge (squash) December 12, 2024 22:24
@GarrettWu GarrettWu merged commit 200c9bb into main Dec 12, 2024
22 checks passed
@GarrettWu GarrettWu deleted the garrettwu-json branch December 12, 2024 22:25
@@ -107,7 +107,7 @@ def from_table(
raise ValueError("must set at most one of 'offests', 'primary_key'")
if any(i.field_type == "JSON" for i in table.schema if i.name in schema.names):
warnings.warn(
"Interpreting JSON column(s) as StringDtype. This behavior may change in future versions.",
"Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just say "... as pyarrow.large_string"? StringDtype here is not relevant anymore, no?

@@ -2564,13 +2564,13 @@ def _get_rows_as_json_values(self) -> Block:
),
T1 AS (
SELECT *,
JSON_OBJECT(
TO_JSON_STRING(JSON_OBJECT(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an explicit TO_JSON_STRING? Doesn't the change in ibis_types.py::ibis_dtype_to_bigframes_dtype take care of this post read?

    if isinstance(ibis_dtype, ibis_dtypes.JSON):
        ...
        return bigframes.dtypes.JSON_DTYPE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants